Introduction

Within the housing industry, there is a subset that rents out housing to other people. AirBnb, a large company in this area, has asked us to help them be able to predict rental prices given recent fears about a crash. We will provide AirBnb with a working model to help predict rental prices. For this scenario, our object of analysis will be an individual AirBnb rented home in the San Francisco Area, the population will be all AirBnb rentals in the San Francisco area from about 2012-present day, and our response variable will be the price of the AirBnb rental. All data collected in our sample was done by Dr. Greatrex.

Our initial thoughts believe that there are potentially 2 factors driving rental prices. One pertains to external factors separate from the residence such as crime. The other pertains to the actual house itself such as being able to work from home, number of reviews, and more.

Variable Name Description
host_name The name of the person renting out the room/house
neighborhood neighborhood name in San-Francisco
room_type Entire home or individual room
price The price that the airbnb rented for per night in USD
minimum_nights The minimum number of nights that a guest must stay in order to book
number_of_reviews Total number of reviews for that property
review_per_month Average number of reviews per month
Number_Listings_by_Host One means that this is the only airbnb rented by that host (e.g. likely their own home), several means that they might be a professional short term letter
availability_365 How many days the property is available during each year
num_trees.500m Number of trees within 500m of the property
num_bikehire.500m Number of bike rental stations within 500m of the property (a measure of tourism)
num_public_art.500m Number of public art works and murals within 500m of the property (a measure of tourism)
num_burglary.500m Number of burglaries within 500m of the property during that month/year
num_motor_theft.500 Number of car thefts within 500m of the property during that month/year
CensusTract_GEOID The census tract that the property is situated in
CT_population_density.km2 The population density in the census tract/neighborhood (people per square km)
CT_median_age The median age in the census tract/neighborhood (years)
CT_percent.incomeGt75E The percent of people who make more than $75,000 per year in that tract/neighborhood (years) e.g. measure of wealth
CT_percent.under18 The percentage of people under the age 18 in the census tract/neighborhood
CT_percent.over65 The percentage of people over age 65 in the census tract/neighborhood
CT_percent.poverty The percent of people in poverty in the census tract/neighborhood
CT_percent.foreignborn The percent of foreign born people in the census tract/ neighborhood
CT_percent.workhome The percent of people who work from home in that tract/neighborhood
CT_percent.withdegree The percent of people with a degree in that tract/neighborhood
CT_percent.collegestudents The percent of college students in that tract/neighborhood
CT_gini_inequalityindex The Gini inequality index in that tract/neighborhood
CT_median.housevalue The median house value in that tract/neighborhood
CT_median.rent The median rent in that tract/neighborhood
CT_percenthouse.rented The percentage of house rented out in that tract/neighborhood
CT_percenthouse.vacant The percentage of vacant houses in that tract/neighborhood
CT_percenthouse.broadband The percentage of houses with broadband internet in that tract/neighborhood
Longitude degrees
Latitude degrees

Data Wrangling

room <- read_excel("AirBnb_ROOM_sample_jxh6215.xlsx")
## New names:
## • `` -> `...1`
house <- read_excel("AirBnb_HOUSE_sample_jxh6215.xlsx")
## New names:
## • `` -> `...1`
room.sf <- st_as_sf(room,coords=c("Longitude","Latitude"),crs=4326)
house.sf <- st_as_sf(house,coords=c("Longitude","Latitude"),crs=4326)
skim(room)
Data summary
Name room
Number of rows 150
Number of columns 34
_______________________
Column type frequency:
character 6
numeric 28
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
…1 0 1 1 3 0 150 0
host_name 0 1 1 16 0 126 0
neighbourhood 0 1 6 21 0 31 0
room_type 0 1 11 11 0 1 0
Number_Listings_by_Host 0 1 3 7 0 2 0
CensusTract_GEOID 0 1 11 11 0 105 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
price 0 1 204.15 115.81 33.00 123.50 174.00 259.50 725.00 ▇▆▂▁▁
minimum_nights 0 1 17.90 17.94 1.00 2.00 30.00 30.00 120.00 ▇▇▁▁▁
number_of_reviews 0 1 57.13 101.16 1.00 4.00 10.00 49.00 450.00 ▇▁▁▁▁
reviews_per_month 0 1 1.24 1.67 0.02 0.16 0.42 1.68 7.67 ▇▁▁▁▁
availability_365 0 1 155.38 117.05 0.00 44.75 145.50 276.75 365.00 ▇▅▅▅▃
num_trees.500m 0 1 22.58 13.81 0.00 12.25 20.00 30.00 74.00 ▆▇▅▁▁
num_bikehire.500m 0 1 51.00 48.32 0.00 13.25 38.00 71.50 224.00 ▇▂▂▁▁
num_burglary.500m 0 1 7.71 6.92 0.00 2.00 6.00 11.00 29.00 ▇▆▂▁▁
num_public_art.500m 0 1 10.49 15.55 0.00 1.00 3.00 17.00 69.00 ▇▂▁▁▁
num_motor_theft.500 0 1 8.32 7.67 0.00 3.00 7.00 11.00 43.00 ▇▃▁▁▁
CT_population_density.km2 0 1 12476.83 10001.21 2236.18 7430.65 10017.02 13831.31 75415.06 ▇▁▁▁▁
CT_median_age 0 1 40.30 6.03 30.00 35.80 40.30 43.40 68.30 ▆▇▃▁▁
CT_percent.incomeGt75E 0 1 0.41 0.14 0.04 0.32 0.41 0.52 0.69 ▂▂▇▆▃
CT_percent.under18 0 1 0.13 0.05 0.01 0.10 0.13 0.16 0.28 ▂▆▇▃▁
CT_percent.over65 0 1 0.16 0.07 0.02 0.11 0.15 0.21 0.55 ▅▇▂▁▁
CT_percent.poverty 0 1 0.45 0.11 0.24 0.36 0.44 0.52 0.71 ▃▇▇▅▁
CT_percent.foreignborn 0 1 0.30 0.12 0.10 0.21 0.28 0.37 0.74 ▇▇▅▂▁
CT_percent.workhome 0 1 0.15 0.06 0.00 0.10 0.14 0.19 0.33 ▂▇▇▆▁
CT_percent.withdegree 0 1 0.59 0.19 0.11 0.49 0.61 0.74 0.94 ▂▂▇▇▅
CT_percent.collegestudents 0 1 0.04 0.03 0.00 0.02 0.03 0.05 0.19 ▇▅▁▁▁
CT_gini_inequalityindex 0 1 0.47 0.05 0.35 0.43 0.46 0.51 0.61 ▁▇▇▅▁
CT_median.housevalue 0 1 690.53 470.81 0.00 368.25 636.50 896.00 2273.00 ▆▇▂▁▁
CT_median.rent 0 1 2479.16 654.57 508.00 2053.00 2525.50 2984.25 3501.00 ▁▁▆▇▆
CT_percenthouse.rented 0 1 0.53 0.18 0.04 0.41 0.54 0.66 0.93 ▂▃▇▇▂
CT_percenthouse.vacant 0 1 0.11 0.07 0.00 0.06 0.10 0.16 0.40 ▇▇▅▁▁
CT_percenthouse.broadband 0 1 0.82 0.09 0.53 0.78 0.85 0.89 0.98 ▁▂▃▇▃
Longitude 0 1 -122.43 0.03 -122.51 -122.45 -122.43 -122.41 -122.39 ▂▃▆▇▆
Latitude 0 1 37.76 0.02 37.71 37.75 37.76 37.78 37.81 ▃▃▇▇▅

Response Variable Analysis

gghistostats(price,data=room,results.subtitle = FALSE)

You can see that we do not have any data that is missing from this file. You can also see that the response variable, price, seems to be slightly skewed. We will run a correlation matrix next to narrow down our predictor variables that we will use. We will also make a tmap in order to identify any noticeable trends regarding geo-location and price.

Correlation Matrix

room.numeric <- room[ , sapply(room,is.numeric)]
corrplot(cor(room.numeric),
         method="ellipse",
         type="lower",
         tl.cex = 0.5,
         number.cex =1,
         tl.srt = 45,
         mar= c(0,0,0,0),
         diag = FALSE)

Tmap

tmap_mode("view")
## tmap mode set to interactive viewing
qtm(room.sf,dots.col="price",title="AirBnB Price",dots.size=.5,dots.palette="Blues")

Of the 34 variables that we are analyzing, 28 of those variables are numeric and 6 are categorical. The correlation matrix conducted suggests that working from home and minimum nights the tenant has to stay seems to have the highest correlation coefficients of the other variables. There also seems to be some overlap between variables as the have correlations with each other. These include crime statistics correlating to median house value and so on. Some variables being related to each other is expected and so we will try to use variables that are no as correlated to other variables. It is also important to note that the location map does not seem to show any true patterns between the location and price.

Simple Linear Regression

We now will make and compare two single predictor, simple linear regression models. The first variable that we will look at is the percentage of people that work from home. This variable showed a pretty good correlation factor with only being related to a few other variables. The other variable that we will take a look at is the percentage of people making more than $75,000. This also showed a good correlation and will be interesting to model.

Model 1: Percentage of People that Work at Home

Scatter plot

percent_home__plot <- ggplot(room, aes(x=CT_percent.workhome, y=price)) +
  geom_point() +
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum() +
  xlab("Percentage Working from Home") +
  ylab("Price") +
  ggtitle("Scatterplot Comparing Price to Home Work Percentage")
percent_home__plot
## `geom_smooth()` using formula = 'y ~ x'

Linear Model

LM_PercentHome <- lm(price~CT_percent.workhome,data = room)
LM_PercentHome
## 
## Call:
## lm(formula = price ~ CT_percent.workhome, data = room)
## 
## Coefficients:
##         (Intercept)  CT_percent.workhome  
##               80.63               832.71

\[ \hat{price} = 80.63 +832.71(PercentWorkingFromHome) \]

LINE Assumption Analysis

ols_plot_resid_fit(LM_PercentHome)

ols_plot_resid_qq(LM_PercentHome)

ols_plot_resid_hist(LM_PercentHome)

ols_test_normality(LM_PercentHome)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.907          0.0000 
## Kolmogorov-Smirnov        0.1019         0.0885 
## Cramer-von Mises         12.7925         0.0000 
## Anderson-Darling          2.5881         0.0000 
## -----------------------------------------------

It is clear that this model fails the LINE assumption tests. The model may not be linear and the errors are not normally distributed or of equal variance. We will apply a transformation to try and fix this model.

Transformations

#adding in transformation columns
room$log_workhome <- log(room$CT_percent.workhome)
room$log_price <- log(room$price)
room$sqrt_workhome <- sqrt(room$CT_percent.workhome)
room$inv_workhome <- 1/(room$CT_percent.workhome)
room$sqrt_price <- sqrt(room$price)
room$inv_price <- 1/(room$price)
newroom <- room
which(room$CT_percent.workhome == min(room$CT_percent.workhome))
## [1] 64
newroom <- newroom[-64,]
LM_log_price_workhome <- lm(log_price~CT_percent.workhome,data=newroom)
ols_plot_resid_stud_fit(LM_log_price_workhome)

summary(LM_log_price_workhome)
## 
## Call:
## lm(formula = log_price ~ CT_percent.workhome, data = newroom)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35308 -0.34492  0.02056  0.31837  1.37310 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.5937     0.0962  47.752  < 2e-16 ***
## CT_percent.workhome   3.9579     0.5930   6.674 4.75e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4587 on 147 degrees of freedom
## Multiple R-squared:  0.2325, Adjusted R-squared:  0.2273 
## F-statistic: 44.54 on 1 and 147 DF,  p-value: 4.751e-10
LM_sqrt_price_workhome <- lm(sqrt_price~CT_percent.workhome,data=newroom)
ols_plot_resid_stud_fit(LM_sqrt_price_workhome)

summary(LM_sqrt_price_workhome)
## 
## Call:
## lm(formula = sqrt_price ~ CT_percent.workhome, data = newroom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8138 -2.4850 -0.1822  1.9163 12.9002 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.6401     0.6854  14.065  < 2e-16 ***
## CT_percent.workhome  28.0230     4.2253   6.632 5.91e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.268 on 147 degrees of freedom
## Multiple R-squared:  0.2303, Adjusted R-squared:  0.2251 
## F-statistic: 43.99 on 1 and 147 DF,  p-value: 5.905e-10
LM_inv_price_workhome <- lm(inv_price~CT_percent.workhome,data=newroom)
ols_plot_resid_stud_fit(LM_inv_price_workhome)

summary(LM_inv_price_workhome)
## 
## Call:
## lm(formula = inv_price ~ CT_percent.workhome, data = newroom)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.004907 -0.002121 -0.000561  0.001486  0.021871 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.0099807  0.0006874  14.520  < 2e-16 ***
## CT_percent.workhome -0.0239590  0.0042374  -5.654  7.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003278 on 147 degrees of freedom
## Multiple R-squared:  0.1786, Adjusted R-squared:  0.173 
## F-statistic: 31.97 on 1 and 147 DF,  p-value: 7.9e-08

We tried different combinations of transforming the predictor variable, the response variable, and both. Most of the options do show any promise as they would either fail the LINE test or have too low of a correlation coefficient. The combination that seems the most promising for this variable is taking the log of the price with the normal predictor. It is important to note that we had to remove a data point that contained a “0” in the percentage working from home. This 0 would have complicated our transformations. We will now analyze our new first model.

New Model 1: Log Price and Percentage of People that Work from home

Scatter plot

percent_home__plot2 <- ggplot(newroom, aes(x=CT_percent.workhome, y=log_price)) +
  geom_point() +
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum() +
  xlab("Percentage Working from Home") +
  ylab("Log of Price") +
  ggtitle("Scatterplot Comparing Log of the Price to Home Work Percentage")
percent_home__plot2
## `geom_smooth()` using formula = 'y ~ x'

Linear Model

LM_log_price_workhome
## 
## Call:
## lm(formula = log_price ~ CT_percent.workhome, data = newroom)
## 
## Coefficients:
##         (Intercept)  CT_percent.workhome  
##               4.594                3.958

\[ log(\hat{price}) = 4.594 +3.958(PercentageWorkFromHome) \]

LINE Assumption Analysis

ols_plot_resid_stud_fit(LM_log_price_workhome) 

ols_plot_resid_qq(LM_log_price_workhome) 

ols_test_normality(LM_log_price_workhome) 
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9959         0.9523 
## Kolmogorov-Smirnov        0.0404         0.9684 
## Cramer-von Mises         18.3092         0.0000 
## Anderson-Darling          0.2347         0.7893 
## -----------------------------------------------
ols_test_f(LM_log_price_workhome)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of log_price 
## 
##       Test Summary        
##  -------------------------
##  Num DF     =    1 
##  Den DF     =    147 
##  F          =    0.5457897 
##  Prob > F   =    0.4612211

This model has met every metric of our LINE assumption. The model seems to be linear from the residual plot. The model also seems to match up well with the QQ plot, with small deviations at both far ends. We can assume that independence is met and has met the equal variance requirement. The corresponding p-value testing for variance was about 46.12%.

Outliers

ols_plot_resid_lev(LM_log_price_workhome)

Since our model met LINE assumptions, we can check for outliers now. From the plot, it is noticeable that there are a lot of leverage points and some outliers, but none are influential points. We can now move onto our second model.

Model 2: Percentage of People Making more than $75,000

Scatter plot

percent_75k__plot <- ggplot(room, aes(x=CT_percent.workhome, y=price)) +
  geom_point() +
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum() +
  xlab("Percentage Making above $75,000") +
  ylab("Price") +
  ggtitle("Scatterplot Comparing Price to those making above $75,000 Percentage")
percent_75k__plot
## `geom_smooth()` using formula = 'y ~ x'

Linear Model

LM_price_75k <-lm(price~CT_percent.incomeGt75E,data = room)
LM_price_75k
## 
## Call:
## lm(formula = price ~ CT_percent.incomeGt75E, data = room)
## 
## Coefficients:
##            (Intercept)  CT_percent.incomeGt75E  
##                  94.11                  267.69

\[ \hat{price} = 94.11 +27.69(PercentMakingMoreThan75k) \]

LINE Assumption Analysis

ols_plot_resid_stud_fit(LM_price_75k)

ols_plot_resid_qq(LM_price_75k)

ols_plot_resid_hist(LM_price_75k)

ols_test_normality(LM_price_75k)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.8605         0.0000 
## Kolmogorov-Smirnov        0.1466         0.0032 
## Cramer-von Mises          15.44          0.0000 
## Anderson-Darling          5.1106         0.0000 
## -----------------------------------------------

Although it seems that this would be a good model, it does fail the normality portion of our LINE assumptions as equal variance. We will try new transformations to the formula to make it better. We will used the same transformations as before with those being log, square root, and inverse functions.

Transformations

#adding in transformation columns
room$log_75k <- log(room$CT_percent.incomeGt75E)
room$sqrt_75k <- sqrt(room$CT_percent.incomeGt75E)
room$inv_75k <- 1/(room$CT_percent.incomeGt75E)
LM_log_price_75k <- lm(log_price~CT_percent.incomeGt75E,data=room)
ols_plot_resid_stud_fit(LM_log_price_75k)

summary(LM_log_price_75k)
## 
## Call:
## lm(formula = log_price ~ CT_percent.incomeGt75E, data = room)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33217 -0.30898 -0.02274  0.27814  1.40910 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.5789     0.1191  38.433  < 2e-16 ***
## CT_percent.incomeGt75E   1.4650     0.2738   5.352 3.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4794 on 148 degrees of freedom
## Multiple R-squared:  0.1621, Adjusted R-squared:  0.1565 
## F-statistic: 28.64 on 1 and 148 DF,  p-value: 3.251e-07
LM_sqrt_price_75k <- lm(sqrt_price~CT_percent.incomeGt75E,data=room)
ols_plot_resid_stud_fit(LM_sqrt_price_75k)

summary(LM_sqrt_price_75k)
## 
## Call:
## lm(formula = sqrt_price ~ CT_percent.incomeGt75E, data = room)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0913 -2.3038 -0.5087  1.6584 13.1518 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.8397     0.8591  11.454  < 2e-16 ***
## CT_percent.incomeGt75E   9.6362     1.9740   4.882 2.69e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.457 on 148 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.1329 
## F-statistic: 23.83 on 1 and 148 DF,  p-value: 2.695e-06
LM_inv_price_75k <- lm(inv_price~CT_percent.incomeGt75E,data=room)
ols_plot_resid_stud_fit(LM_inv_price_75k)

summary(LM_inv_price_75k)
## 
## Call:
## lm(formula = inv_price ~ CT_percent.incomeGt75E, data = room)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0058980 -0.0020275 -0.0003596  0.0013209  0.0214247 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.0106175  0.0008206  12.938  < 2e-16 ***
## CT_percent.incomeGt75E -0.0102020  0.0018856  -5.411 2.47e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003302 on 148 degrees of freedom
## Multiple R-squared:  0.1651, Adjusted R-squared:  0.1595 
## F-statistic: 29.27 on 1 and 148 DF,  p-value: 2.472e-07

We have done the same thing as the previous model, doing different combinations of transformations to the response and predictor variable. In this case, the one that seems to be linear with the highest correlation of the two is logging the price and keeping the percent of people making more than $75,000 the same. We will now analyze it further.

New Model 2: Percentage of People Making above $75,000 and log of the Price

Scatter plot

percent_75k__plot2 <- ggplot(room, aes(x=CT_percent.workhome, y=log_price)) +
  geom_point() +
  geom_smooth(method=lm , color="red", se=FALSE) +
  theme_ipsum() +
  xlab("Percentage Making above $75,000") +
  ylab("Log of Price") +
  ggtitle("Scatterplot Comparing Log of Price to those making above $75,000 Percentage")
percent_75k__plot2
## `geom_smooth()` using formula = 'y ~ x'

Linear Model

LM_log_price_75k
## 
## Call:
## lm(formula = log_price ~ CT_percent.incomeGt75E, data = room)
## 
## Coefficients:
##            (Intercept)  CT_percent.incomeGt75E  
##                  4.579                   1.465

\[ log(\hat{price}) = 4.579 +1.465(PercentMakingMoreThan75k) \]

LINE Assumptions

ols_plot_resid_stud_fit(LM_log_price_75k)

ols_plot_resid_qq(LM_log_price_75k)

ols_plot_resid_hist(LM_log_price_75k)

ols_test_normality(LM_log_price_75k)
## Warning in ks.test.default(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9907         0.4290 
## Kolmogorov-Smirnov        0.0648         0.5541 
## Cramer-von Mises         19.0283         0.0000 
## Anderson-Darling          0.5184         0.1852 
## -----------------------------------------------

After making the transformation, this model fits our LINE assumption much better. The model is linear, is approximately normal except for deviations along the extremes, the residuals are normally distributed, and the points seem to be independent. The corresponding p-value we got from the Shapiro-Wilk test was 0.4290. We will now check for outliers.

Outliers

ols_plot_resid_lev(LM_log_price_75k)

There are clearly some outliers as well as leverage points, but no influential points. We can now move on to comparing our two simple regression models.

Model Comparison

We have been asked to compare our two models based on 5 different criteria. These include the percentage of variability explained, AIC, effect(slope) size, whether the slope is significant, and whether they met LINE assumption. Here are the models we will be using:

Model 1:\[ log(\hat{price}) = 4.594 +3.958(PercentageWorkFromHome) \]

Model 2: \[ log(\hat{price}) = 4.579 +1.465(PercentMakingMoreThan75k) \]

ols_regress(LM_log_price_workhome)
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.482       RMSE                 0.456 
## R-Squared               0.233       MSE                  0.210 
## Adj. R-Squared          0.227       Coef. Var            8.848 
## Pred R-Squared          0.212       AIC                194.614 
## MAE                     0.369       SBC                203.625 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares         DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression      9.374          1          9.374    44.541    0.0000 
## Residual       30.936        147          0.210                     
## Total          40.310        148                                    
## --------------------------------------------------------------------
## 
##                                      Parameter Estimates                                      
## ---------------------------------------------------------------------------------------------
##               model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
## ---------------------------------------------------------------------------------------------
##         (Intercept)    4.594         0.096                 47.752    0.000    4.404    4.784 
## CT_percent.workhome    3.958         0.593        0.482     6.674    0.000    2.786    5.130 
## ---------------------------------------------------------------------------------------------
summary(LM_log_price_workhome)
## 
## Call:
## lm(formula = log_price ~ CT_percent.workhome, data = newroom)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35308 -0.34492  0.02056  0.31837  1.37310 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.5937     0.0962  47.752  < 2e-16 ***
## CT_percent.workhome   3.9579     0.5930   6.674 4.75e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4587 on 147 degrees of freedom
## Multiple R-squared:  0.2325, Adjusted R-squared:  0.2273 
## F-statistic: 44.54 on 1 and 147 DF,  p-value: 4.751e-10
ols_regress(LM_log_price_75k)
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.403       RMSE                 0.476 
## R-Squared               0.162       MSE                  0.230 
## Adj. R-Squared          0.156       Coef. Var            9.252 
## Pred R-Squared          0.141       AIC                209.078 
## MAE                     0.380       SBC                218.110 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares         DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression      6.581          1          6.581     28.64    0.0000 
## Residual       34.009        148          0.230                     
## Total          40.590        149                                    
## --------------------------------------------------------------------
## 
##                                       Parameter Estimates                                        
## ------------------------------------------------------------------------------------------------
##                  model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
## ------------------------------------------------------------------------------------------------
##            (Intercept)    4.579         0.119                 38.433    0.000    4.343    4.814 
## CT_percent.incomeGt75E    1.465         0.274        0.403     5.352    0.000    0.924    2.006 
## ------------------------------------------------------------------------------------------------
summary(LM_log_price_75k)
## 
## Call:
## lm(formula = log_price ~ CT_percent.incomeGt75E, data = room)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33217 -0.30898 -0.02274  0.27814  1.40910 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.5789     0.1191  38.433  < 2e-16 ***
## CT_percent.incomeGt75E   1.4650     0.2738   5.352 3.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4794 on 148 degrees of freedom
## Multiple R-squared:  0.1621, Adjusted R-squared:  0.1565 
## F-statistic: 28.64 on 1 and 148 DF,  p-value: 3.251e-07
  1. Percentage of variability explained:

    Model 1 has a R-squared of 0.233 while model 2 has a R-squared of 0.162. In terms of just using R-squared, model 1 is much better than model 2. Model 1 explains about 23.3% of the variation of price in this sample.

  2. AIC:

    Model 1 has an AIC of 194.614 and model 2 has an AIC of 209.078. Model 1 wins here again with the smaller AIC value.

  3. Effect (slope) size: Since both models are just the log of the response variable and normal predictor, the slope and intercept in both models are in the same units. Model 1 has a slope of 3.958 log dollars for every 1% increase in the percentage of people that work from home while model two is about half that at 1.465 log dollars for every 1% increase in people making more than $75,000. Model 1 has the larger slope.

  4. Is the slope significant:

    After doing a t-test for the slope of model 1 and model 2, we find that both have p-values of essentially 0. This means that both slopes are significant.

  5. Does the model meet LINE:

    As done before, it is shown that both models met LINE after transformations.

Conclusion

After comparing the two models, we have decided to use model 1 as our model because it explains more of the variability, has a lower AIC, and the slope is higher than model 2.

Multiple Linear Regression

We will move onto multiple linear regression models. We will begin with a “full model” that has about 10 variables and then use best subsets to narrow it down further. The ten variables that I have here are a mixture of variables that proved a higher correlation in the correlation matrix before and factors that I believe could work together. We will also be using the log of price in this model as the two previous models showed promise whenever that transformation was applied.

names(room)
##  [1] "...1"                       "host_name"                 
##  [3] "neighbourhood"              "room_type"                 
##  [5] "price"                      "minimum_nights"            
##  [7] "number_of_reviews"          "reviews_per_month"         
##  [9] "Number_Listings_by_Host"    "availability_365"          
## [11] "num_trees.500m"             "num_bikehire.500m"         
## [13] "num_burglary.500m"          "num_public_art.500m"       
## [15] "num_motor_theft.500"        "CensusTract_GEOID"         
## [17] "CT_population_density.km2"  "CT_median_age"             
## [19] "CT_percent.incomeGt75E"     "CT_percent.under18"        
## [21] "CT_percent.over65"          "CT_percent.poverty"        
## [23] "CT_percent.foreignborn"     "CT_percent.workhome"       
## [25] "CT_percent.withdegree"      "CT_percent.collegestudents"
## [27] "CT_gini_inequalityindex"    "CT_median.housevalue"      
## [29] "CT_median.rent"             "CT_percenthouse.rented"    
## [31] "CT_percenthouse.vacant"     "CT_percenthouse.broadband" 
## [33] "Longitude"                  "Latitude"                  
## [35] "log_workhome"               "log_price"                 
## [37] "sqrt_workhome"              "inv_workhome"              
## [39] "sqrt_price"                 "inv_price"                 
## [41] "log_75k"                    "sqrt_75k"                  
## [43] "inv_75k"
fullmodel <- lm(log_price~reviews_per_month+number_of_reviews+availability_365+CT_percent.incomeGt75E+CT_percent.workhome+CT_percent.withdegree+CT_population_density.km2+CT_median_age+CT_percenthouse.vacant+minimum_nights,data = room)

BestSubsets <- ols_step_best_subset(fullmodel)
BestSubsets
##                                                                                              Best Subsets Regression                                                                                              
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Model Index    Predictors
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##      1         CT_percent.workhome                                                                                                                                                                                 
##      2         CT_percent.workhome minimum_nights                                                                                                                                                                  
##      3         number_of_reviews CT_percent.workhome minimum_nights                                                                                                                                                
##      4         number_of_reviews CT_percent.workhome CT_population_density.km2 minimum_nights                                                                                                                      
##      5         number_of_reviews CT_percent.workhome CT_population_density.km2 CT_percenthouse.vacant minimum_nights                                                                                               
##      6         number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_population_density.km2 CT_percenthouse.vacant minimum_nights                                                                        
##      7         number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_percenthouse.vacant minimum_nights                                                  
##      8         number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_median_age CT_percenthouse.vacant minimum_nights                                    
##      9         reviews_per_month number_of_reviews CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_median_age CT_percenthouse.vacant minimum_nights                  
##     10         reviews_per_month number_of_reviews availability_365 CT_percent.incomeGt75E CT_percent.workhome CT_percent.withdegree CT_population_density.km2 CT_median_age CT_percenthouse.vacant minimum_nights 
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 
##                                                     Subsets Regression Summary                                                    
## ----------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                            
## Model    R-Square    R-Square    R-Square     C(p)        AIC         SBIC         SBC        MSEP       FPE       HSP       APC  
## ----------------------------------------------------------------------------------------------------------------------------------
##   1        0.2377      0.2326      0.2178    41.2339    194.8929    -231.6632    203.9248    31.3580    0.2118    0.0014    0.7829 
##   2        0.3655      0.3568      0.3394    11.8618    169.3845    -256.5255    181.4270    26.2825    0.1787    0.0012    0.6604 
##   3        0.4008      0.3885      0.3573     5.1833    162.7909    -262.7383    177.8441    24.9903    0.1710    0.0011    0.6320 
##   4        0.4300      0.4143       0.381     0.0101    157.2963    -267.6819    175.3601    23.9374    0.1649    0.0011    0.6093 
##   5        0.4310      0.4112       0.375     1.7747    159.0439    -265.7705    180.1183    24.0643    0.1668    0.0011    0.6165 
##   6        0.4323      0.4085      0.3689     3.4351    160.6790    -263.9533    184.7641    24.1749    0.1686    0.0011    0.6232 
##   7        0.4336      0.4057      0.3631     5.1304    162.3508    -262.0930    189.4465    24.2931    0.1705    0.0011    0.6302 
##   8        0.4340      0.4019      0.3573     7.0296    164.2421    -260.0316    194.3484    24.4489    0.1727    0.0012    0.6383 
##   9        0.4341      0.3977      0.3487     9.0081    166.2189    -257.8937    199.3359    24.6210    0.1750    0.0012    0.6468 
##  10        0.4341      0.3934      0.3361    11.0000    168.2102    -255.7429    204.3378    24.7980    0.1774    0.0012    0.6555 
## ----------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Of the 10 different combinations of variables that we could use, the fourth model using number of reviews, percentage of people that work from home, population density, and minimum nights.

finalmodel <- lm(log_price~number_of_reviews+CT_percent.workhome+CT_population_density.km2+minimum_nights,data=room)
finalmodel
## 
## Call:
## lm(formula = log_price ~ number_of_reviews + CT_percent.workhome + 
##     CT_population_density.km2 + minimum_nights, data = room)
## 
## Coefficients:
##               (Intercept)          number_of_reviews  
##                 5.090e+00                 -1.138e-03  
##       CT_percent.workhome  CT_population_density.km2  
##                 3.255e+00                 -9.210e-06  
##            minimum_nights  
##                -1.180e-02

Summary

Model 3:

\[ log(\hat{price}) = 5.09 -0.001138(NumberofReviews)+3.255(PercentageWorkFromHome)-0.00000921(PopulationDensity(km2))-0.018(MinimumNights) \]

ols_plot_resid_stud_fit(finalmodel)

ols_plot_resid_qq(finalmodel)

ols_plot_resid_hist(finalmodel)

ols_test_normality(finalmodel)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9927         0.6491 
## Kolmogorov-Smirnov        0.0449         0.9224 
## Cramer-von Mises         21.6152         0.0000 
## Anderson-Darling          0.3216         0.5263 
## -----------------------------------------------
ols_test_f(finalmodel)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of log_price 
## 
##        Test Summary        
##  --------------------------
##  Num DF     =    1 
##  Den DF     =    148 
##  F          =    0.02279498 
##  Prob > F   =    0.880197
ols_plot_resid_lev(finalmodel)

This multiple linear regression model would meet every criteria of line. There does not seem to be evidence of a curve, we can assume independence, is approximately normal with slight deviations along the extremes, and the equal variance p-value we got from the f test was 0.88. This means that we cannot reject the hypothesis that the variance is homogeneous.There do seem to be outliers and leverage points along with two influential points. Those data points may need to be investigated further. Whenever we tried to remove those influential points, the new model would still have influential points in it.

summary(finalmodel)
## 
## Call:
## lm(formula = log_price ~ number_of_reviews + CT_percent.workhome + 
##     CT_population_density.km2 + minimum_nights, data = room)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33415 -0.24587 -0.01494  0.27183  1.14868 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.090e+00  1.122e-01  45.343  < 2e-16 ***
## number_of_reviews         -1.138e-03  3.431e-04  -3.318  0.00115 ** 
## CT_percent.workhome        3.255e+00  5.188e-01   6.273 3.84e-09 ***
## CT_population_density.km2 -9.210e-06  3.379e-06  -2.726  0.00721 ** 
## minimum_nights            -1.180e-02  1.945e-03  -6.067 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3994 on 145 degrees of freedom
## Multiple R-squared:   0.43,  Adjusted R-squared:  0.4143 
## F-statistic: 27.35 on 4 and 145 DF,  p-value: < 2.2e-16

If we take a look at the summary, all the models have a p value that would be significant. The percentage of those working from home and minimum nights are a bit more significant than the number of reviews and population density. The percentage that works from home seems to have the biggest effect on the model with its slope being 3.255.